KFLUXBUGS-863: introduce results fixes with requisite performance related tuning (dev/stage) #3616

gabemontero · 2024-04-22T18:02:14Z

This replaces #3594 for dev and stage

I've initially place a hold on this, as I want to get sign off from multiple sources before merging.

Our second attempt at fixing the watcher memory leak. This also address intermittent failure to store logs because of cancelled context.

This also brings in the ability to pprof profile the api server, though it is not enabled at this time.

Since we had to change the threading model for log storage in order to fix the results watcher memory leak, rather than just the normal pipeline-service bump to pick up the new results images, I have picked those commits and am combining them with our initial increase in k8s client tuning and worker threads for both the api server and watcher.

I AM AVOIDING DATABASE LEVEL TUNING FOR NOW, SAVING THAT FOR A SEPARATE EFFORT

Aside from adjust for the memory leak fix, the worker thread and k8s client tuning have proven to greatly improve controller latency with the core tekton controller and chains controller. The hope is this will help the watcher as well, both when storing as well as pruning.

Also, during the result outage, I saw some semblance of leader election based HA going on, where prior testing had only ever showed backup/restore behavior. Until sharding can be properly tested, I don't want chance enablement based on replica start up timing occurring.

Lastly, I want to start moving away from https://github.com/openshift-pipelines/pipeline-service/
As such, I'm replicating the api server config map generator entirely here. If this becomes a point of contention I cannot convince others about, I'll create a PR afterward to add config.env deltas to the pipeline-service repo, but I don't want to delay this change more than needed.

@enarha @savitaashture @khrm @hugares @stuartwdouglas PTAL

@sayan-biswas @avinal @ramessesii2 .... given the importance of this change, I would ask you all PTAL as well when you have cycles. Thanks.

gabemontero · 2024-04-22T18:34:33Z

Forgot to mention @enarha @savitaashture @khrm @hugares @stuartwdouglas @sayan-biswas @avinal @ramessesii2

Per https://go.dev/doc/diagnostics#profiling and the "Can I profile my production services?" section I am considering just enabling profiling, but without cpu profiling, in case we ever have to exec into the api server or watcher and get a goroutine dump.

If you have opinions there please advise.

I plan on discussing in the arch call as well.

gabemontero · 2024-04-23T16:10:30Z

you probably already saw @pmacik but the e2e's are clean and this at least deploys now

gabemontero · 2024-04-29T18:36:39Z

/test appstudio-e2e-tests

gabemontero · 2024-05-07T13:27:27Z

still green @enarha @pmacik ... going to pull in through openshift-pipelines/pipeline-service#994

…to one to match prod Since we had the change the threading model for log storage in order to fix the results watcher memory leak, rather than just the normal pipeline-service bump to pick up the new results images, I have picked those commits and am combining them with our initial increase in k8s client tuning and worker threads for both the api server and watcher. Aside from the memory leak, the worker thread and k8s client tuning have proven to greatly improve controller latency with the core tekton controller and chains controller. The hope is this will help the watcher as well, both when storing as well as pruning. Also, during the result outage, I saw some semblance of leader election based HA going, where prior testing had only ever showed backup/restore behavior. Until sharding can be properly tested, I don't want chance enablement based on replica start up timing occurring. rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED

enarha · 2024-05-07T13:55:30Z

@gabemontero Just to understand the end goal here, do we want to eventually promote those changes to prod or is it the intention to only use that in stage?

gabemontero · 2024-05-07T13:58:56Z

@gabemontero Just to understand the end goal here, do we want to eventually promote those changes to prod or is it the intention to only use that in stage?

eventually these results fixes / tuning absolutely need to go into prod @enarha

just being super careful given the relative fragility of results

ideally, we complete perf testing @pmacik agreed to take on
then we see if any stress testing can happen in stage, otherwise we just let it run there for a few days minimally,
then we promote to prod and watch results there like a hawk for at least a day or two

gabemontero · 2024-05-07T15:03:39Z

ok still green ... going to try some local results testing

gabemontero · 2024-05-07T16:04:30Z

works for me locally as well @enarha @pmacik ... none of the x509 cert errors that @pmacik saw a couple of weeks ago on his test cluser ..... I'm going to re-read https://issues.redhat.com/browse/KONFLUX-2096 and see if I can run his test against my cluster while he is busy with RHDH.

rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED

gabemontero · 2024-05-08T14:34:32Z

ok e2e's pass even with the enablement of pprof in the watcher (see my Apr 30 konflux community presentation https://docs.google.com/document/d/18Rt4VQh3MH7_O_FdkpmRp8-PacOyGPq4WCxT4Jm-ZD0/edit#heading=h.jpnstykbavio on why we are doing this at least for the interim)

testing locally to confirm basic results still works, as well as being able to get thread dump if needed

gabemontero · 2024-05-08T20:31:01Z

Local testing looks good: see https://issues.redhat.com/browse/KFLUXBUGS-863?focusedId=24697433&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24697433

If Pavel cannot get to S3 based testing soon, may move forward with this

rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED

gabemontero · 2024-05-09T19:10:58Z

/test appstudio-hac-e2e-tests

gabemontero · 2024-05-10T14:59:59Z

I've also been gathering staging data to compare with after this merges.

Even on the much less used stage envs for example, I'm seeing results watcher queue depth go over 10.

My stress testing on my personal cluster can equal or beat that.

gabemontero · 2024-05-10T21:08:31Z

Created 1000 simple pipelineruns (each with 2 taskruns) staring just after 4 PM US Eastern Friday May 10 on stage-m01 over a 50 minute timespan in my tenant (yes I pushed up against the for-gmontero-deployments and compute-build quotas :-) ).

Observations:

my personal test cluster created the same amount without error; on stage-m01 got 251 of the cancelled context errors we'll be fixing with this merges and promotes to prod (also 61 of one of the other known bugs we are still working upstream)
the watcher work queue depth was typically 10 or fewer; on stage-m01 it got to over 2200
even with the ill fated launching of grpc calls onto a separate goroutine, latency was around 8 to 9 on stage-m01 vs. anywhere from 9 to 20 in my test cluster, but minio vs. real s3 could have come into play there

I'll get a comparison on stage-m01 after this PR merges. Aside from node size, minio vs. S3 is a key difference.

openshift-ci · 2024-05-10T21:08:58Z

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: gabemontero

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gabemontero · 2024-05-10T22:50:48Z

/hold cancel

…ance related tuning (dev/stage) (redhat-appstudio#3616)" This reverts commit 8b62b98.

gabemontero · 2024-05-11T17:19:26Z

So the minio vs. s3 difference had far more of the opposite effect than anticipated .... log storage was much slower with the simple stress test going to S3 than it had been with the same code/settings as to minio.

I'm about done gathering some debug data, and then will revert this PR with #3688

After I have had chance to analyze the data, I'll report back on next steps: more threads, larger timeouts, or it is time for a total redesign or pivot to kubearchive.

…ance related tuning (dev/stage) (#3616)" (#3688) This reverts commit 8b62b98.

gabemontero added the do-not-merge/hold label Apr 22, 2024

gabemontero requested review from avinal, enarha, hugares, ramessesii2, savitaashture, sayan-biswas and stuartwdouglas April 22, 2024 18:02

openshift-ci bot requested a review from Roming22 April 22, 2024 18:02

gabemontero force-pushed the tune-results-stage branch from f2d9e2a to 81271be Compare April 22, 2024 21:11

gabemontero mentioned this pull request Apr 30, 2024

pin results image until watcher mem leak fix is deemed ok #3644

Merged

gabemontero force-pushed the tune-results-stage branch from 81271be to e4e903c Compare May 7, 2024 12:19

gabemontero force-pushed the tune-results-stage branch from e4e903c to d821cc9 Compare May 7, 2024 13:50

enable watcher pprof in dev/stage

2d7329d

rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED

enable tekton results api server pprof in dev/stage

ce3b40a

rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED

gabemontero force-pushed the tune-results-stage branch from 24aa628 to ce3b40a Compare May 9, 2024 17:38

gabemontero added the approved label May 10, 2024

gabemontero added the lgtm label May 10, 2024

openshift-ci bot removed the do-not-merge/hold label May 10, 2024

openshift-merge-bot bot merged commit 8b62b98 into redhat-appstudio:main May 10, 2024

gabemontero deleted the tune-results-stage branch May 11, 2024 00:51

gabemontero added a commit to gabemontero/infra-deployments that referenced this pull request May 11, 2024

Revert "KFLUXBUGS-863: introduce results fixes with requisite perform…

8b21872

…ance related tuning (dev/stage) (redhat-appstudio#3616)" This reverts commit 8b62b98.

openshift-merge-bot bot pushed a commit that referenced this pull request May 11, 2024

Revert "KFLUXBUGS-863: introduce results fixes with requisite perform…

c3722aa

…ance related tuning (dev/stage) (#3616)" (#3688) This reverts commit 8b62b98.

gabemontero mentioned this pull request Jun 14, 2024

KFLUXBUGS-863: introduce results fixes with requisite performance related tuning (dev/stage) #3884

Merged

KFLUXBUGS-863: introduce results fixes with requisite performance related tuning (dev/stage) #3616

KFLUXBUGS-863: introduce results fixes with requisite performance related tuning (dev/stage) #3616

Uh oh!

Conversation

gabemontero commented Apr 22, 2024

Uh oh!

gabemontero commented Apr 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabemontero commented Apr 23, 2024

Uh oh!

gabemontero commented Apr 29, 2024

Uh oh!

gabemontero commented May 7, 2024

Uh oh!

enarha commented May 7, 2024

Uh oh!

gabemontero commented May 7, 2024

Uh oh!

gabemontero commented May 7, 2024

Uh oh!

gabemontero commented May 7, 2024

Uh oh!

gabemontero commented May 8, 2024

Uh oh!

gabemontero commented May 8, 2024

Uh oh!

gabemontero commented May 9, 2024

Uh oh!

gabemontero commented May 10, 2024

Uh oh!

gabemontero commented May 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented May 10, 2024

Uh oh!

gabemontero commented May 10, 2024

Uh oh!

gabemontero commented May 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gabemontero commented Apr 22, 2024 •

edited

Loading

gabemontero commented May 10, 2024 •

edited

Loading